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Abstract 

This paper presents a pattern-based method that 
can be used to infer adjectival scales, such as 
{lukewarm,warm, hot), from a corpus. Specifi¬ 
cally, the proposed method uses lexical patterns 
to automatically identify and order pairs of scale- 
mates, followed by a filtering phase in which un¬ 
related pairs are discarded. For the filtering phase, 
several different similarity measures are imple¬ 
mented and compared. The model presented in 
this paper is evaluated using the current standard, 
along with a novel evaluation set, and shown to be 
at least as good as the current state-of-the-art. 

1 Introduction* 

Adjectival scales are sets of (typically gradable) 
adjectives denoting values of the same prop¬ 
erty (temperature, quality, difficulty), ordered 
by their expressive strength (Horn, 1972). A 
classical example is {decent, good, excellent). In 
this paper, I also use the term scale for or¬ 
dered sets of non-gradable adjectives, such as 
{local, national, global). Scales are ordered such 
that each adjective is stronger (more informative) 
than the one preceding it. In this paper, I present 
a corpus-based method that makes use of lexical 
patterns to extract pairs of scalemates: adjectives 
that occur on the same scale. As we shall see, 
due to the nature of the patterns used to extract 
the scalemates, we also have a reliable way of or¬ 
dering those pairs. 

What I will not attempt here, is to go be¬ 
yond scalemates and try to construct full adjectival 
scales (though see Section 6.4 for some ideas on 
how to do so). My interest lies in detecting differ¬ 
ences in informativeness and expressiveness be¬ 
tween adjectives. This is useful e.g. for question- 

'A11 data from this paper is available online at 

http://kyoto.let.vu.nl/“miltenburg/ 
public_data/adjectival-scales/ 


answering and information extraction (de Marn- 
effe et al., 2010).^ On a more theoretical level, this 
paper provides the first step in determining which 
expressions might serve as a stronger alternative 
to a given adjective. This is useful to diversify the 
study of scalar inferences (cf. Doran et al. 2009). 
Indeed, this paper finds its origin in the study of 
scalar diversity (Van Tiel et al., 2014). 

2 Background 

Now over twenty years ago, Hatzivassiloglou and 
McKeown (1993, henceforth H&M) outlined the 
first method to semi-automatically identify adjec¬ 
tival scales, producing clusters akin to those in 
(Pantel, 2003). Their model consists of the fol¬ 
lowing three steps: 

1. Extract word patterns. 

2. Compute word similarity measures. 

3. Combine similarities to create clusters of adjec¬ 
tives. 

H&M also suggest to use tests such as Horn’s 
(1969) X is ADJ, even ADJ to identify adjec¬ 
tives that are on the same scale (henceforth scale- 
mates).^ However, they rejected this idea be¬ 
cause “such tests cannot be used computationally 
to identify scales in a domain, since the specific 
sentences do not occur frequently enough in a cor¬ 
pus to produce an adequate description of the ad¬ 
jectival scales in the domain” (p. 173). In this 
contribution, I will show that the advent of large 
corpora made this approach not only feasible, but 
also competitive with the current state-of-the art. 

After H&K, early work in sentiment analysis at¬ 
tempted to classify documents by determining the 
average polarity (positivity or negativity) of the 
words in those documents (Turney and Liftman, 
2002). Research in this direction shows that we 

^Sheinman et al. (2013, 808-814) list more applications. 

^Similarly, Hearst (1992) later identified hyponyms using 
lexical patterns. 




can not only obtain clusters of semantically re¬ 
lated adjectives (like H&M do), but we can also 
determine the semantic orientation of those adjec¬ 
tives. This work stops just short of determining 
the ordering of scalemates in terms of expressive 
strength. 

Potts (2011) provides both a method to catego¬ 
rize words by their orientation, and a method to 
induce scales. These rely on a data set of online 
reviews (books, movies, restaurants). The catego¬ 
rization method works as follows. Following the 
same approach as de Marneffe et al. (2010), a re¬ 
gression model for the distribution of the ratings 
is computed for each adjective.^ Adjectives with 
a positive correlation with the ratings are catego¬ 
rized as positive, and vice versa. Lacking a signif¬ 
icant correlation, adjectives are labeled ‘neutral.’ 
All words are then ordered by the strength of their 
coefficients in the regression analysis, after which 
related adjectives are clustered together using their 
similar-to’s in WordNet (Fellbaum, 1998). These 
clusters are taken to correspond to lexical scales. 
Potts evaluates his scales on the MPQA subjectiv¬ 
ity lexicon (Wilson et al., 2005). In this dataset 
words are labeled either ‘strongly subjective’ or 
‘weakly subjective.’ So for each pair of adjectives 
ai, 02 , the MPQA lexicon can indicate whether oi 
is stronger/weaker than 02 or whether both adjec¬ 
tives have the same score. Comparing his results 
with the MPQA lexicon, Potts’ method achieves a 
65% accuracy on the stronger/weaker items. 

Although the results discussed above are 
very interesting, and certainly deserve fur¬ 
ther investigation, the focus on sentiment pre¬ 
cludes the study of ‘sentiment-neutraF scales 
(e.g. {optional, necessary, essential)). With our 
pattern-based method, we provide a more general 
algorithm that should be able to identify adjectival 
scales across the board. 

3 A pattern-based approach 

Our approach is described in the three sections be¬ 
low. First we describe the basic method, followed 
by an overview of the measures we implemented 
to filter the raw data. Finally, we provide a moti¬ 
vation for our choice of corpus. 


‘'Potts also studies the polarity of adverbs, but these lie 
outside the scope of this paper. 


3.1 Basic method 

As mentioned in the introduction, we employed 
a pattern-based method to detect adjectival scales 
(cf. Hearst, 1992). We used the following patterns: 

- ADJi if not ADJ 2 - ADJi and perhaps ADJ 2 

- ADJi but not ADJ 2 - between ADJi and ADJ 2 

- from ADJi to ADJ 2 - ADJi or at least ADJ 2 

The patterns are tagged with part-of-speech in¬ 
formation. These patterns tell us which adjectives 
are likely to be scalemates, as well as how they are 
ranked on the scale. In all except the last pattern, 
ADJi is generally weaker than ADJ 2 , therefore the 
ordering should be (ADJi, ADJ 2 ). If a pair occurs 
in two different orders, the most frequent order is 
kept. On a draw, the pair is discarded.^ 

3.2 Similarity measures 

The patterns listed above are fairly reliable at iden¬ 
tifying scalemates, but no result is perfect. There¬ 
fore, we implemented three different types of sim¬ 
ilarity measures to ensure that the pairs of adjec¬ 
tives are semantically related. 

LSA (Deerwester et al., 1990) If two potential 
scalemates have a non-negative cosine sim¬ 
ilarity, they are considered similar.^ 

Shared attributes If two potential scalemates 
share an attribute, they are considered simi¬ 
lar. We used two sets of attributes: 

- SUMO mappings (Pease et al., 2002). 

- WordNet synset attributes. 

Thesaurus If two scalemates occur in the same 
thesaurus entry, they are considered similar. 
We used the following resources: 

- Lin’s (1998) dependency-based thesaurus. 

- The Moby thesaurus (Ward, 1996). 

- Roget’s thesaurus.^ 

We also implemented two methods to filter the re¬ 
sults. These filters are described below. 

Antonymy If two potential scalemates are 
antonyms, they are removed. Antonyms are 
detected: 

^There is some room for improvement here. E.g. one 
could establish a measure of reliability by demanding that the 
pair is ordered the same way in at least 80% of the cases. We 
will not pursue this matter here. 

®We used the TASA model from: http://www. 
lingexp.uni-tuebingen.de/ z2 /LSAspaces/ 
^We used the Jarmasz & Szpakowicz’ (2001) HEAD files. 



- on the basis of their morphol¬ 
ogy; pairs of the form {A, prefix- 
A} are considered antonyms iff 
prefix S {il, in, un, im, dis, non-} 

- if they are listed in WordNet as such. 

Polarity If two potential scalemates do not share 
the same polarity in Hu & Liu’s (2004) opin¬ 
ion lexieon, they are removed. 

3.3 Corpus 

We used the UMBC WebBase eorpus (Han et ah, 
2013, 3 bn words) to look up the oeeurrenees of 
the patterns. The eorpus is tagged with part-of- 
speeeh data, and its size and seope make it ideal 
for our purposes. 

4 Results 

We found 32470 pairs of potential sealemates, 
eontaining 16971 different adjeetives. In general, 
what we see in the data is that the more patterns 
a pair oeeurs in, the more likely it is that the pair 
eonsists of two sealemates. Below are some of the 
pairs that oeeurred in 5-6 different patterns. 

- warm hot - regional national 

- regional global - difficult impossible 

- weekly monthly - unlikely impossible 

Compare these with the pairs below, that oeeurred 
only in one type of pattern. Some of these pairs 
are indeed sealemates (e.g. transitive, symmetric), 
while others are elearly antonyms {good, inade¬ 
quate).^ 

- good inadequate - interactive incremental 

- affordable scalable - damnable devil-ridden 

- transitive symmetric - ecclesial nonecclesial 

As Table 1 makes elear, most pairs only oeeur in 
one type of pattern. What this means is that we 
eannot do without filtering our results. 


Patterns 

1 

2 

3 

4 

5 

6 

Pairs 

29,593 

2,420 

336 

88 

30 

3 


Table 1: Pairs oeeurring in n types of patterns. 


Table 2 presents the number of pairs that were 
retrieved for eaeh similarity measure-filtering 

*One reviewer asks whether the adjectives found in only 
one pattern are infrequent. While there are pairs containing 
two rare adjectives, most pairs consist of one frequent and one 
infrequent adjective, e.g. {ugly, grotesque), but there are also 
examples of pairs with two fairly common adjectives (smart, 
gifted). 


combination (third column). The table shows big 
differences in the amount of results between the 
different similarity measures. Whereas we get 
1533 results using Roget’s thesaurus, our LSA- 
based method produces nearly ten times as many 
pairs of sealemates. 

Differences in the amount of results are due to 
two factors: coverage and lenience. Consider Ro¬ 
get’s thesaurus and LSA. Roget’s is handcrafted, 
and has a much lower coverage than our auto¬ 
matically generated LSA model. As a conse¬ 
quence, LSA yields a lot more results. Regard¬ 
ing lenience: depending on the similarity mea¬ 
sure, the conditions on ‘being similar’ can be more 
or less lenient. Thesaurus-based measures (Moby, 
Roget) can be considered strict, demanding near¬ 
synonymy. The SUMO measure, on the other 
hand, is quite lenient; for example, it considers any 
pair of adjectives that could be considered ‘subjec¬ 
tive assessment attributes’ to be similar. Needless 
to say, with the SUMO measure we get a lot more 
results. In the case of LSA, we can modify the le¬ 
niency by raising or lowering the threshold value 
for the cosine similarity function. We did not ex¬ 
periment with this threshold. 

4.1 Evaluation procedure 

In previous work, evaluation of semantic scales 
has been done in two ways: intrinsically, using the 
MPQA lexicon (like Potts 2011), and extrinsically, 
using the indirect question-answer pairs (IQAP) 
corpus (de Marneffe et ah, 2010). An example of 
an indirect question-answer pair is given in (1). 

(1) A: Advertisements can be good or bad. Was it 
a good ad? 

B: It was a great ad. 

To know whether B’s answer implies ‘yes’ or ‘no,’ 
it is necessary to know whether great is better than 
good or not.^ In what follows, we will focus on the 
intrinsic evaluation of our results, as our main goal 

*de Marneffe et al. (2010) do this in two ways: either us¬ 
ing review data, like Potts (2011) does as well, or using Web 
searches. E.g. to answer the question in (i), De Marneffe et 
al. searched the Web for ‘warm weather,’ in order to find out 
the typical range and distribution of degrees associated with 
warm weather. 

(i) Q: Is it warm outside? 

A: It’s 25°C 

These search results could in theory be compared with those 
from other queries, allowing for a ranking of temperature- 
related adjectives. Whether this yields good scales remains 



Method 

Filter 

# Pairs 

# Test 

Score 

Raw 

None 

32,470 

2,611 

60.90 


Antonyms 

30,971 

2,565 

60.55 


Polarity 

30,628 

2,090 

59.67 


Combined 

29,249 

2,070 

59.42 

Lin 

None 

8,086 

1,027 

57.84 


Antonyms 

7,747 

992 

57.26 


Polarity 

7,393 

859 

56.00 


Combined 

7,149 

844 

55.57 

ESA 

None 

15,233 

1,808 

60.56 


Antonyms 

14,682 

1,767 

60.10 


Polarity 

14,005 

1,463 

58.85 


Combined 

13,561 

1,447 

58.53 

Moby 

None 

2,230 

287 

63.76 


Antonyms 

2,172 

285 

63.86 


Polarity 

2,108 

268 

62.31 


Combined 

2,058 

267 

62.17 

Roget 

None 

1,533 

225 

62.22 


Antonyms 

1,513 

224 

62.05 


Polarity 

1,445 

203 

59.61 


Combined 

1,430 

202 

59.41 

SUMO 

None 

12,061 

1,947 

62.25 


Antonyms 

11,498 

1,904 

61.87 


Polarity 

10,610 

1,548 

61.30 


Combined 

10,152 

1,529 

61.02 

WordNet 

None 

1,602 

141 

70.92 


Antonyms 

1,384 

114 

67.54 


Polarity 

1,402 

95 

69.47 


Combined 

1,245 

84 

66.67 


Table 2: Pair counts for each similarity measure, 
along with MPQA evaluation scores (percent¬ 
age correct) for each similarity measure-filtering 
method combination. 

is to get reliable data. Extrinsic evaluation is left 
to further research. 

Like Potts (2011), we make use of the MPQA 
sentiment lexicon. For all adjective pairs that con¬ 
trast in strength according to the lexicon, we check 
whether our algorithm produces the correct order¬ 
ing: {weak, strong). Because the MPQA lex¬ 
icon is two-valued, it often occurs that pairs of 
adjectives have the same label (i.e. are judged 
equally subjective). This contrasts with Potts’ 
(2011) method, which uses continuous values and 
thus two adjectives are rarely judged to be equally 
subjective. As a consequence of this, Potts’ model 

to be seen. 


has an overall accuracy of 26%. We believe that 
a restriction of the evaluation set to pairs of adjec¬ 
tives that contrast in their subjectivity provides a 
more reliable assessment of the quality of Potts’ 
data (and thus 65% accuracy is the score to beat). 
Either way, the coarse-grainedness of the MPQA 
lexicon is an issue that needs to be taken into ac¬ 
count. 

In addition to the MPQA lexicon, we use psy¬ 
chological arousal norms (i.e. values indicating 
how arousing particular words are), collected by 
Warriner et al. (2013, henceforth WKB) for 13,915 
English lemmas. The (continuous) arousal val¬ 
ues range from 1 (calm) to 9 (aroused). Exam¬ 
ples of adjectives with low arousal values are calm 
and dull, and quiet. Some arousing adjectives are 
ecstatic and exciting. Intuitively, the latter have 
more expressive strength, and as such we can use 
arousal values as an indication of how scalar ex¬ 
pressions should be ordered: {low, high). Since 
the WKB data has not been used before in any test 
of scalarity, we will also compare both evaluation 
measures to assess their reliability. 

4.2 Evaluation 

Table 2 presents general statistics and the results 
of the evaluation procedure. The pattern-based 
method turns out to have a very high recall, with 
32,470 different pairs of adjectives. Out of all 
these pairs, 2,611 scalemates have contrasting sub¬ 
jectivity measures in the MPQA database. 1,590 
(60.9%) of these pairs are correctly predicted to 
be in {weak, strong) order. A two-tailed Fisher’s 
exact test reveals that the difference between our 
results and Potts’ (2011) data is not statistically 
significant (p=0.1547).^° 

As presented in rows 2-A for each method, 
weeding out antonym pairs and adjectives with op¬ 
posite polarities does reduce the number of scale- 
mates our algorithm yields, but it does not improve 
the results. However, this was to be expected: it is 
not the goal of these filters to improve ordering. 
Rather they are meant to exclude pairs of adjec¬ 
tives that are not on the same positive or negative 
(sub)scale. A different measure is needed to as¬ 
sess the quality of the scales. Likewise, we cannot 
fully assess which of our different similarity mea¬ 
sures is superior. 

The WKB evaluation yields slightly lower 
scores than those obtained with the MPQA dataset 

*°Potts achieves 201/308 correct predictions (p. 65). 



(56-60%). But how reliable are those seores? To 
find out, we took the raw sealemates and eompared 
the orderings predieted by the MPQA and WKB 
datasets. It turns out that they agree on only 62% 
of the orderings. This is a surprisingly low num¬ 
ber, whieh easts doubt on the value of these data 
sets as an individual evaluation metrie for adjee- 
tival seales. We made the evaluation more robust 
by eombining the two evaluation sets, using only 
those pairs for whieh both sets agree on the order. 
The seores for our algorithm using this new evalu¬ 
ation set is given in Table 3. 


Method 

# Items 

Seore 

Raw 

1288 

67.49 

Lin 

523 

68.50 

ESA 

904 

67.32 

Moby 

132 

72.73 

Roget 

111 

72.07 

SUMO 

1004 

68.31 

WordNet 

66 

77.27 

Potts 2011 

74 

58.11 


Table 3: Results for the evaluation using only pairs 
for whieh the MPQA and WKB data agreed on 
the ordering. Seores are given for the unfiltered 
data. Filtering generally had a negative effeet on 
the seore of about one pereent. 

The results for our algorithm on this new eval¬ 
uation set are notieeably (around seven pereent- 
age points) better than on either of the datasets 
alone. How would Potts’ (2011) methods seore 
on the improved evaluation set? We expeeted 
that his approaeh might fare better here, as his 
method relies more on emotion, finding words thaf 
express people’s feelings abouf eerfain produels. 
That seems like an ideal mateh for an evaluation 
based on subjeetivity and arousal. Our pattern- 
based method is more general, and also finds (sub- 
)scales thaf are nol emolion-relaled (e.g. the pair 
{important, crucial )). Contrary to our expeeta- 
tions, Potts’ method has a lower aeeuraey, pre¬ 
dieting the eorreet order 58% of the time (43/74 
items). 

5 Similar work 

Another pattern-based approaeh implementing 
H&M’s ideas is Adj Seales, whieh uses online 

iiPotts’ data is available at http : //web . Stanford, 
edu/"cgpotts/data/wordnetscales/. 


seareh engines to determine seale-order (Shein- 
man and Tokunaga, 2009; Sheinman et ah, 2013). 
For eaeh pair {head-word, similar-adjective} in 
WordNet, Sheinman and eolleagues searehed the 
Web using patterns similar to ours to see whieh 
ordering was more prevalent. E.g. sinee (2a) re¬ 
turns signiheantly more results on Google than 
(2b), we may eonelude that the ordering should be 
{warm, hot). Sheinman et al. show that the preei- 
sion of Adj Seales is elose to native speaker level. 

(2) a. warm, if not hot b. hot, if not warm 

The main differenee between Sheinman et al.’s 
work and ours is that Sheinman and eolleagues 
take adjeetive pairs from WordNet to see how they 
should be ordered, whereas our method is more 
agnostie: we use patterns to extraet adjeetive pairs, 
and only afterwards do we eheek whether both ad- 
jeetives are related. There are three problems 
with using WordNet as a starting point: 

1. Not all words are eovered by WordNet. 

2. Not all related adjeetives are related in Word- 
Net, e.g. {difficult, impossible) 

3. It ignores ad-hoc seales (Hirsehberg, 1985), 
made up of words that are not typieally re¬ 
lated.'^ 

In our approaeh, the seareh spaee is not eon- 
strained by any lexieal resouree. We simply eol- 
leet all pairs of adjeetives that oeeur in one of the 
patterns. To find related pairs that aren’t related in 
WordNet, one ean simply ehoose a different sim¬ 
ilarity measure. Ad-hoe seales ean be found by 
looking through the raw results, or by ehoosing a 
lenient similarity measure. 

6 Future research 

Our results are promising, but as the researeh on 
adjeetival seales has not reeeived mueh attention 
in the literature, there are still many interesting av¬ 
enues of researeh. First, there is a elear need for 

'^Theoretically, we can obtain the same results as Shein¬ 
man et al. by using WordNet’s similar-to relation as a simi¬ 
larity measure. 

'^One reviewer notes that Sheinman et al. do not intend to 
depart from WordNet, but instead order the adjectives already 
present in WordNet. With this goal in mind, WordNet’s cov¬ 
erage is not an issue. But when the goal is to enrich WordNet, 
or to build a separate lexical resource, we should be able to 
look beyond WordNet’s vocabulary. 

'"^This is relevant for researchers in pragmatics, but of little 
importance if our goal is to acquire conventional scales. Still, 
being too restrictive a priori may ignore potentially interest¬ 
ing results (cf. problem 2). 



gold standard data, as the available evaluation data 
is not speeifieally designed for this task, and show 
elear shorteomings (e.g. eoarse-grainedness —as 
diseussed in Seetion 4.1, whieh preeludes the eval¬ 
uation of sealemates that are very elose in terms 
of expressive strength.) Seeond, there is the pos¬ 
sibility of extending our work to other languages. 
Third, I see a lot of potential in using veetor-based 
approaehes to generate ordered seales from a eor- 
pus. I diseuss these issues in turn. Finally, I eon- 
sider the possibility of eonstrueting larger seales 
from our set of sealemates. 

6.1 Creating a gold standard 

We need to have a real gold standard eontaining 
pairs of sealemates annotated with their ordering 
and polarity. This gold standard should be bal- 
aneed in terms of emotion-related seales and other 
kinds of seales. We believe that the data generated 
using our pattern-based method, eombined with 
Potts’ (2011) data should provide a good starting 
point for building a reliable lexieal resouree. 

After we finished our data-analysis, Christopher 
Potts (p.e.) shared the results of an online experi¬ 
ment earried out on Amazon’s Meehanieal Turk. 
Partieipants were shown a set of adjeetive pairs, 
and asked for eaeh pair to judge whether the first 
adjeetive is stronger, weaker or as strong as the 
seeond adjeetive. This is exaetly the kind of data 
we need to evaluate the order of automatieally 
identified seales. Table 4 shows fhe agreemenf be- 
fween our proposed evaluafion sef (eombining fhe 
MPQA and WKB dafa) and fhe elieifed dafa, fol¬ 
lowed by fhe resulfs of our algorifhm on Polls’ 
dafa. We observe lhal fhe eombinafion of fhe 
MPQA and WKB dafa provides a reasonable es- 
limale of fhe eorreel ordering of adjeelival seales, 
bul our algorifhm does mueh heller on fhe elieifed 
dafa lhan on our proposed evaluation sef. 

There are slill Iwo problems wilh Polls’ dafa 
sef: (i) if is limiled in size, and (ii) if is based on 
Polls’ (2011) sludy on reviews, and as sueh is lim¬ 
iled in eoverage (i.e. il has no ‘senlimenl-neulraT 
scales). We are planning lo expand Ihe sel of gold 
slandard dala in Ihe fulure. 

6.2 Other languages & automation 

In this paper, we have only looked at English 
scales. How would one go about extracting scales 
in other languages? Could we further automate 
our algorithm to generate scales for multiple lan¬ 
guages at once? This requires a way to automati- 


Our proposed data set 


Agreement 

6 

7 

8 

9 

10 

# Test items 

63 

49 

36 

28 

16 

Accuracy 

84 

88 

92 

93 

100 

Pattern-based search 

Agreement 

6 

7 

8 

9 

10 

# Test items 

40 

36 

28 

23 

15 

Accuracy 

78 

83 

89 

91 

93 


Table 4: Results for our new evaluation set and 
our algorithm on the evaluation data provided by 
Christopher Potts (p.e.). The columns correspond 
to the level of agreement between participants. I.e. 
how many participants (out of 10) agreed on the 
first adjective being either stronger or weaker than 
the second. Making this requirement more strict 
reduces the amount of test items that we could 
use, but increases the precision of our evaluation 
set and our algorithm. 

cally detect patterns in which pairs of sealemates 
are likely to occur. There are two ways of doing 
so, both using sets of known sealemates: (i) Shein- 
man and Tokunaga (2009) take 10 seed word pairs, 
and extract only those patterns that fulfill certain 
conditions (e.g. appearing with at least 3 differ¬ 
ent seed pairs, occurring more than once for each 
pair, not being restricted to one meaning domain). 
Schulam and Fellbaum (2010) show how this ap¬ 
proach can be applied to German, (ii) Lobanova 
(2012) takes a probabilistic approach, estimating 
the likelihood of patterns to contain one of the 
seed pairs. She applies this method to find pat¬ 
terns likely to contain antonyms, but her approach 
can easily be extended to the scale-domain. It may 
also be fruitful to try a hybrid approach, combin¬ 
ing the two. 

Once we have scale ordering data for mul¬ 
tiple languages, it should be possible to auto¬ 
matically verify the results through EuroWordNet 
(Vossen, 2004), using the Interlingual Index (ILI): 
intuitively, corresponding synsets should have the 
same ordering relation in all languages. 

6.3 Semantic vectors 

Mohtarami et al. (2012) create a semantic vec¬ 
tor space with twelve basic emotions as its di¬ 
mensions. The position of each word Wn in this 
space is determined by the co-occurrence counts 



of Wn with words in the synsets of the seleeted 
basie emotion words. The authors use this infor¬ 
mation to eompute what they call ‘word pair sen¬ 
timent similarity.’ On the basis of this similarity 
measure, words expressing similar emotions can 
be clustered together. While the authors do not go 
into this, the right ordering of a set of adjectives 
might be achieved by maximizing the sentiment 
similarity between all neighboring pairs of adjec¬ 
tives within a cluster. Kim and de Mameffe (2013) 
provide a more general vector-based method to or¬ 
der adjectives on a scale. Making use of earlier ob¬ 
served semantic regularities in neural embeddings 
(Mikolov et ah, 2013), the authors show how a 
scale can be generated by extracting words that are 
located at intermediate points between two vectors 
from antonym pairs. Though the results (using the 
IQAP corpus) look promising, the extraction of 
scalemates has not yet been done on a larger scale. 

6.4 Building larger scales 

A naive way to build scales from pairs of scale- 
mates would be to chain them together, so 
that e.g. {lukewarm, warm) and {warm, hot) 
could be used to form {lukewarm, warm, hot). 
But this strategy completely ignores polysemy. 
Consider the pairs {inexpensive, cheap) and 
{cheap, rubbish). Together, these yield the in¬ 
coherent scale {inexpensive, cheap, rubbish) that 
mixes up two dimensions: COST and QUALITY. 
A solution to this issue might be to only chain 
scales if the senses of the adjectives involved are 
in the same domain (either verified through Word- 
Net, or using an automatic sense clustering algo¬ 
rithm such as CBC (Pantel, 2003)). However, after 
crossing that hurdle we run into the problem that 
scales are highly context-dependent. It might be 
best to construct adjectival scales on the fly: rather 
than having a stored list of full-blown scales, build 
a scale consisting of adjectives that are relevant to 
the discourse. A minimal requirement for such 
a process is to have pairwise ordering informa¬ 
tion for all adjectives involved, which is what our 
pattern-based method produces. 

7 Conclusion 

In this paper we have looked at different methods 
to automatically find scales or scalemafes from a 
corpus since H&K’s original paper. Our findings 
show fhaf a paffern-based mefhod can be very suc¬ 
cessful af identifying pairs of scalemafes, as long 


as fhe corpus is big enough. This mirrors findings 
from Sheinman and Tokunaga (2009) and Shein- 
man ef al. (2013). One of our confribufions is fhe 
use of a wide range of similarify measures as well 
as an anfonymy fiber and a polarify fiber fo clean 
up fhe resubs. 

We have also proposed a new evaluafion 
mefhod, combining fhe MPQA subjecfivify lexi¬ 
con wifh fhe WKB arousal norms. The combina¬ 
tion of fhese fwo dafa sefs makes fhe evaluafion of 
scale ordering mefhods more reliable. This allevi¬ 
ates, buf does nof eliminafe fhe need for a frue gold 
sfandard, which could finally enable us fo move 
towards fhe aufomafic idenfificafion of adjectival 
scales. 
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